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Abstract 

The significance of the study of the theoretical and practical properties of Ad¬ 
aBoost is unquestionable, given its simplicity, wide practical use, and effectiveness 
on real-world datasets. Here we present a few open problems regarding the behav¬ 
ior of “Optimal AdaBoost,” a term coined by Rudin, Daubechies, and Schapire 
in 2004 to label the simple version of the standard AdaBoost algorithm in which 
the weak learner that AdaBoost uses always outputs the weak classifier with lowest 
weighted error among the respective hypothesis class of weak classifiers implicit in 
the weak learner. We concentrate on the standard, “vanilla” version of Optimal Ad¬ 
aBoost for binary classification that results from using an exponential-loss upper 
bound on the misclassification training error. We present two types of open prob¬ 
lems. One deals with general weak hypotheses. The other deals with the particular 
case of decision stumps, as often and commonly used in practice. Answers to the 
open problems can have immediate significant impact to (1) cementing previously 
established results on asymptotic convergence properties of Optimal AdaBoost, for 
finite datasets, which in turn can be the start to any convergence-rate analysis; (2) 
understanding the weak-hypotheses class of effective decision stumps generated 
from data, which we have empirically observed to be significantly smaller than the 
typically obtained class, as well as the effect on the weak learner’s running time 
and previously established improved bounds on the generalization performance of 
Optimal AdaBoost classifiers; and (3) shedding some light on the “self control” 
that AdaBoost tends to exhibit in practice. 
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1 Introduction 


Due to space constraints, we concentrate on stating the open problems and conjectures 
without entering into the details. We refer the reader to our manuscript on the con¬ 


vergence properties of Optimal AdaBoost for additional details | Belanich and Ortiz 


2012| , recently updated for presentation and clarification purposes. We also refer the 
reader to that manuscript for further discussion of the important implications, briefly 
listed in the Abstract, that answers to the open problems and conjectures stated here 
would have. 

We note that our main interest is not highly synthetic, “low-dimensional” examples 
that contradict the conjectures unless, of course, such examples are the simple start of 
more sophisticated constructions of non-trivial and realistic counterexamples. 


Technical Preliminaries and Notation. Let X denote the feature space (i.e., the 
set of all inputs) and { —be the set of (binary) output labels. To simplify 
notation, let 22 = A x {—be the set of possible input-output parrs. In typ¬ 
ical AdaBoost, we want to learn from a given, fixed dataset of m training exam¬ 
ples D = where each input-output pair 

S 22, for all examples I = 1,..., m. We make the standard assumption 
that each example in D comes i.i.d. from a probability space (22, E, P), where 22 is 
the outcome space, E is the (a-algebra) set of possible events with respect to 22 (i.e., 
subsets of 22), and P : E —K is the probability measure. 

We denote the set of hypotheses that the weak learner that AdaBoost uses, or simply 
the weak-hypothesis class, by 22. We say that 22 is AdaBoost-natural with respect to D 
if (1) the hypothesis h, such that, for all x £ X, h(x) = 1, is in 22; (2) if ft, S 22, then 
—ft S 22; and (3) for all h £'H, there exists an (x, y) G D such that h{x) ^ y. 

In our work, we study Optimal AdaBoost as a dynamical system of the weights w 
over the examples in 22; we also refer to such w as the example or sample weights. 


in a way similar to previous work [ Rudin et al.| 20041. In particular, we take a dy¬ 
namical system view to the Optimal AdaBoost update rule of the example weights w 
on D. That is, each ru is a probability distribution over the m examples in D. The 
set of all such w’s, denoted by A^, cotTesponds to the state space of the AdaBoost- 
induced dynamical system. Denote by (wi, W 2 , ■ ■ ■) the infinite sequence of examples 
weights that AdaBoost would generate if it were run infinitely (i.e., the total number 
of rounds T -G oo). We deviate slightly in the initialization of wi, which is often 
uniform over the set of examples: i.e., wi{l) = — for I — 1,... ,m. Instead, we let 
Wi ^ Uniform(Am). 

We also assume that the weak learner has a deterministic tie-braking rule', i.e., de¬ 
noting err(ft; w, D) = w{l)^ [h{x^^^) f and22(ti;, D) = argmin^jg^ err(ft; w, D), 

for every example weight w that AdaBoost could generate, the weak learner always 
outputs the same weak hypothesis ft* € 'H{w,D). We call ft* the (weak-learner’s) 
representative hypothesis of the set 'H{w,D). In addition, we assume that if the set 
T-L{w,D) = T-L{w',D) for any other w' f w, then the representative hypothesis of 
TL(w', D) is also ft*. 
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2 The No-Ties Conjecture 

The following conjecture essentially states that Optimal AdaBoost eventually has no 
ties in the selection of the best weak-classifier at each round. We use this no-ties 
condition to establish the convergence of the AdaBoost classifier, its generalization 
error, and in fact, the time/per-round average of any Li measurable function of the 
tct’s generated by Optimal AdaBoost, which include the output classifier’s margin, the 
example margins, as well as the weighted error e^’s and the weak-hypothesis weight 
ctt’s of the selected hypotheis ft-t’s at each round t. 

We denote by /r both the Borel and the countable measure, as appropriate and clear 
from context. In the statement below, we assume that the characterization of the set of 
all /r-probability spaces, and all p-measurable spaces over T-L, each depend on their own 
different set of parameters with Borel or counting measurable spaces, as appropriate for 
the corresponding cr-algebras. 

Conjecture 1 (No-Ties Conjecture) For ^-almost all probability spaces {T>,Yj,P) 
and any dataset D ^ P, and p,-almost all FL that are AdaBoost-natural with respect to 
D, there exist m' > 0, such that if m > m! is the size of D, then, P-almost surely, 
either (l),for all t > T' rounds of Optimal AdaBoost, either (La) D)\ = 1; or 

(Lb) for all pairs ht,h[ € 'H{wt,D), for all I = 1,... ,m; or (2) 

limt^oo wtil)l[htix^'-^) 7 ^ ht{x'^‘'>)] = 0, where ht, h[ € U{wt, D). 

Conjecture 2 (Measure-Zero-Decision-Boundary Conjecture) For fi-almost all prob¬ 
ability spaces I'D, S, P) and any dataset D ^ P, and p,-almost all FL that are AdaBoost- 
natural with respect to D, there exist m' ,T' > 0, such that if m > m', and T > T' 
is the total number of rounds of Optimal AdaBoost, then the decision boundary of the 
binary classifier that Optimal AdaBoost outputs after T rounds when given dataset D 
as input has P-measure zero. 

In our work we employ tools from ergodic theory to establish our convergence re¬ 
sults. We provide a non-constructive proof of the existence of a measure for which the 
Optimal-AdaBoost update is measure-preserving. 

Conjecture 3 (Constructive-Proof Conjecture) For p-almost all probability spaces 
{P, S, P) and any dataset D ^ P, and ^.-almost all FL that are AdaBoost-natural 
with respect to D, there exists m! > 0, such that if m > m!, then there exists a 
constructive proof of existence of a measure for which the Optimal AdaBoost weight 
update is measure-preserving, P-almost surely. 


3 AdaBoosting Decision Stumps 

For simplicity, we concentrate on the feature space X = [0,1]", the n-dimensional 
hypercube, so that T) = [0,1]” x {—1, -1-1}. Denote by FL the finite set of decision 
stumps on the finite dataset D induced by using the so-called midpoint rule. In this 
rule, we project D along each feature dimension i and create a decision stump h based 
on the midpoint between any pair of distinct consecutive examples 
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with different labels ^ yih+i)^ such that, denoting the corresponding midpoint 


by x' = 




, we can dehne the decision stump as h{x) = sign(a;i — 


We eliminate from H any dominated hypothesis', that is, we do not need to consider 
any h & such that there exists another h' € %, with the property that h' ^ 
=> ^ y^^^ for alH = 1,..., m. Denote the resulting effective set by £. 

Further, denote by Ut = Ut=i{^t} of unique decision stumps actually selected 

by Optimal AdaBoost from £ after T rounds. 


Problem 1 (Bounding the Number of Effective Stumps) Given a measurable space 
{T), S, P), a dataset D P, and the number of rounds of Optimal AdaBoost T. Pro¬ 
vide non-trivial upper and lower bounds on \'H\, \£\, and \IAt\. 


Trivial upper and lower bounds are 1 < \Ut\ <\£\ <\H\ < 2{n{m — 1) + 1). 


Conjecture 4 (Logarithmic Growth on Unique Stumps) For p-almost all probabil¬ 
ity spaces {Tty'S, P) and any dataset D ^ P, there exist m! ,T' > 0, such that if 


m > m' and T > T', we have E 


\Ut\ 


D 


< (logT + ly, for some c G [1,2), 


P-almost surely. 
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